## Parsed with column specification:
## cols(
## country = col_character(),
## year = col_integer(),
## `Life Ladder` = col_double(),
## `Log GDP per capita` = col_double(),
## `Social support` = col_double(),
## `Healthy life expectancy at birth` = col_double(),
## `Freedom to make life choices` = col_double(),
## Generosity = col_double(),
## `Perceptions of corruption` = col_double(),
## `Positive affect` = col_double(),
## `Negative affect` = col_double(),
## `Standard deviation of ladder by country-year` = col_double(),
## `Standard deviation/Mean of ladder by country-year` = col_double()
## )
This report examines possible contributing factors to the Happiness of a nation’s populace (“AVGHappiness”) rated on a scale of 1-10. All variables are national averages, mostly of survey responses, with the exception of “country”,“year”,“HappinessSD”,“HappinessSD/mean”. The primary goal of this report is determine the most predictive measure of average national Happiness.
It looks like the response variable is either noisy or multi-modal. It should be tested for multi-modality using Hartigans’ dip test. The box-plot indicates if there is bi-modality, It doesn’t strongly affect the distribution.
H0: The variable is uni-modal, Ha: The variable is multi-modal, significance level: 90%
Result: The test’s p-value of: 0.9299 is not significant at the 90% level. Multi-modality is not significant.
## year AVGHappiness LogGDPPC SocialSupport LifeExpectancy
## year 1.00 0.00 0.05 -0.02 0.11
## AVGHappiness 0.00 1.00 0.78 0.69 0.73
## LogGDPPC 0.05 0.78 1.00 0.66 0.85
## SocialSupport -0.02 0.69 0.66 1.00 0.57
## LifeExpectancy 0.11 0.73 0.85 0.57 1.00
## Freedom 0.12 0.52 0.36 0.43 0.33
## Generosity -0.02 0.23 -0.01 0.09 0.05
## AVGIsCorrupt -0.05 -0.44 -0.35 -0.22 -0.32
## PosAffect 0.00 0.56 0.32 0.48 0.30
## NegAffect 0.14 -0.24 -0.11 -0.32 -0.06
## HappinessSD 0.23 -0.11 -0.05 -0.12 -0.01
## HappinessCV 0.18 -0.75 -0.55 -0.57 -0.51
## Freedom Generosity AVGIsCorrupt PosAffect NegAffect
## year 0.12 -0.02 -0.05 0.00 0.14
## AVGHappiness 0.52 0.23 -0.44 0.56 -0.24
## LogGDPPC 0.36 -0.01 -0.35 0.32 -0.11
## SocialSupport 0.43 0.09 -0.22 0.48 -0.32
## LifeExpectancy 0.33 0.05 -0.32 0.30 -0.06
## Freedom 1.00 0.36 -0.50 0.62 -0.31
## Generosity 0.36 1.00 -0.29 0.42 -0.16
## AVGIsCorrupt -0.50 -0.29 1.00 -0.30 0.28
## PosAffect 0.62 0.42 -0.30 1.00 -0.35
## NegAffect -0.31 -0.16 0.28 -0.35 1.00
## HappinessSD -0.10 -0.19 0.30 -0.03 0.48
## HappinessCV -0.38 -0.22 0.40 -0.40 0.47
## HappinessSD HappinessCV
## year 0.23 0.18
## AVGHappiness -0.11 -0.75
## LogGDPPC -0.05 -0.55
## SocialSupport -0.12 -0.57
## LifeExpectancy -0.01 -0.51
## Freedom -0.10 -0.38
## Generosity -0.19 -0.22
## AVGIsCorrupt 0.30 0.40
## PosAffect -0.03 -0.40
## NegAffect 0.48 0.47
## HappinessSD 1.00 0.69
## HappinessCV 0.69 1.00
The pairs plot appears to show a non-linear relationship between AVGisCorrupt and every other variable. The strongest relationship in the correlation matrix is that of LifeExpectancy~LogGDPPC. The most significant relationships with the independent variable in order of significance are LogGDPPC, HappinessCV, LifeExpectancy, and SocialSupport.
A first-order linear model with all eleven predictors.
##
## Call:
## lm(formula = AVGHappiness ~ year + LogGDPPC + SocialSupport +
## LifeExpectancy + Freedom + Generosity + AVGIsCorrupt + PosAffect +
## NegAffect + HappinessSD + HappinessCV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.01254 -0.16045 -0.03552 0.11895 1.56132
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.952600 5.427304 -1.465 0.143096
## year 0.005891 0.002711 2.173 0.029953 *
## LogGDPPC 0.163516 0.014164 11.545 < 2e-16 ***
## SocialSupport 0.246433 0.100373 2.455 0.014220 *
## LifeExpectancy -0.001467 0.001818 -0.807 0.419803
## Freedom 0.130785 0.075853 1.724 0.084922 .
## Generosity 0.533880 0.057208 9.332 < 2e-16 ***
## AVGIsCorrupt -0.738123 0.052732 -13.998 < 2e-16 ***
## PosAffect 0.374223 0.108272 3.456 0.000566 ***
## NegAffect 0.459225 0.125829 3.650 0.000274 ***
## HappinessSD 2.014199 0.042229 47.697 < 2e-16 ***
## HappinessCV -10.422986 0.168518 -61.851 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2719 on 1231 degrees of freedom
## Multiple R-squared: 0.9429, Adjusted R-squared: 0.9423
## F-statistic: 1847 on 11 and 1231 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: AVGHappiness
## Df Sum Sq Mean Sq F value Pr(>F)
## year 1 0.00 0.00 0.0001 0.99269
## LogGDPPC 1 967.27 967.27 13085.8630 < 2e-16 ***
## SocialSupport 1 86.24 86.24 1166.6867 < 2e-16 ***
## LifeExpectancy 1 29.61 29.61 400.6112 < 2e-16 ***
## Freedom 1 66.96 66.96 905.8303 < 2e-16 ***
## Generosity 1 24.79 24.79 335.3538 < 2e-16 ***
## AVGIsCorrupt 1 8.16 8.16 110.3958 < 2e-16 ***
## PosAffect 1 35.34 35.34 478.1239 < 2e-16 ***
## NegAffect 1 0.02 0.02 0.3046 0.58113
## HappinessSD 1 0.21 0.21 2.8438 0.09198 .
## HappinessCV 1 282.77 282.77 3825.5481 < 2e-16 ***
## Residuals 1231 90.99 0.07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The analysis of variance table suggests that all but NegAffect and HappinessSD are significant predictors. The coefficient tests suggest that all but year, LifeExpectancy, and Freedom are significant. The R-squared is 0.9429, with adjusted R-squared of 0.9423, which indicate that most of the variability in AVGHappiness is being explained by this model. The residual standard error is .2719 average happiness, which is small relative to the range of AVGHappiness values (2.70 to 7.97 AVGhappiness).
Residual analysis suggests that there is a curvature effect missing from the model or else a transformation is needed. The Box-Cox method suggests…
The Box-Cox analysis suggests an inverse power transformation, with \(\lambda=-0.775\).
The Box-Cox plot for fit2 indicates little or no further transformation is needed, as the confidence interval includes 1.
##
## Call:
## lm(formula = bcAVGHappiness ~ year + LogGDPPC + SocialSupport +
## LifeExpectancy + Freedom + Generosity + AVGIsCorrupt + PosAffect +
## NegAffect + HappinessSD + HappinessCV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.043602 -0.004566 0.000014 0.004584 0.043918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.620e-01 1.782e-01 2.593 0.00963 **
## year 2.022e-04 8.900e-05 2.272 0.02324 *
## LogGDPPC 5.655e-03 4.651e-04 12.159 < 2e-16 ***
## SocialSupport 1.556e-02 3.296e-03 4.722 2.61e-06 ***
## LifeExpectancy -6.359e-05 5.968e-05 -1.065 0.28688
## Freedom 4.671e-03 2.490e-03 1.876 0.06095 .
## Generosity 1.643e-02 1.878e-03 8.747 < 2e-16 ***
## AVGIsCorrupt -1.683e-02 1.731e-03 -9.720 < 2e-16 ***
## PosAffect -8.025e-03 3.555e-03 -2.258 0.02415 *
## NegAffect 9.400e-03 4.131e-03 2.275 0.02306 *
## HappinessSD 1.338e-01 1.387e-03 96.504 < 2e-16 ***
## HappinessCV -6.539e-01 5.533e-03 -118.176 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.008927 on 1231 degrees of freedom
## Multiple R-squared: 0.9785, Adjusted R-squared: 0.9783
## F-statistic: 5100 on 11 and 1231 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: bcAVGHappiness
## Df Sum Sq Mean Sq F value Pr(>F)
## year 1 0.00051 0.00051 6.3875 0.01162 *
## LogGDPPC 1 2.74204 2.74204 34411.1304 < 2.2e-16 ***
## SocialSupport 1 0.28368 0.28368 3559.9896 < 2.2e-16 ***
## LifeExpectancy 1 0.10336 0.10336 1297.1556 < 2.2e-16 ***
## Freedom 1 0.10696 0.10696 1342.3186 < 2.2e-16 ***
## Generosity 1 0.02470 0.02470 309.9553 < 2.2e-16 ***
## AVGIsCorrupt 1 0.00005 0.00005 0.5934 0.44125
## PosAffect 1 0.09366 0.09366 1175.3617 < 2.2e-16 ***
## NegAffect 1 0.00000 0.00000 0.0015 0.96943
## HappinessSD 1 0.00241 0.00241 30.2578 4.597e-08 ***
## HappinessCV 1 1.11285 1.11285 13965.6766 < 2.2e-16 ***
## Residuals 1231 0.09809 0.00008
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The residuals plots confirm the finding that the residual is more normal, however the QQ plot appears to show significant tails still. I believe that this is our optimal Box-Cox transformation. The r-squared value of .978 is significantly higher than that in fit1. According to the summary and anova for fit2, there are fewer significant predictors. This calls for elimination.
Removing LifeExpectancy from the model, we obtain:
##
## Call:
## lm(formula = bcAVGHappiness ~ LogGDPPC + HappinessCV + AVGIsCorrupt +
## HappinessSD + Generosity + SocialSupport + NegAffect + PosAffect +
## Freedom)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.042786 -0.004732 -0.000019 0.004636 0.043602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8660068 0.0045631 189.787 < 2e-16 ***
## LogGDPPC 0.0054034 0.0003593 15.038 < 2e-16 ***
## HappinessCV -0.6510428 0.0053040 -122.745 < 2e-16 ***
## AVGIsCorrupt -0.0169166 0.0017290 -9.784 < 2e-16 ***
## HappinessSD 0.1335927 0.0013507 98.906 < 2e-16 ***
## Generosity 0.0162278 0.0018750 8.655 < 2e-16 ***
## SocialSupport 0.0155602 0.0032993 4.716 2.68e-06 ***
## NegAffect 0.0094530 0.0041213 2.294 0.0220 *
## PosAffect -0.0081013 0.0035526 -2.280 0.0228 *
## Freedom 0.0055490 0.0024647 2.251 0.0245 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.00894 on 1233 degrees of freedom
## Multiple R-squared: 0.9784, Adjusted R-squared: 0.9783
## F-statistic: 6214 on 9 and 1233 DF, p-value: < 2.2e-16
##
## Correlation of Coefficients:
## (Intercept) LogGDPPC HappinessCV AVGIsCorrupt HappinessSD
## LogGDPPC -0.59
## HappinessCV -0.66 0.51
## AVGIsCorrupt -0.42 0.28 0.04
## HappinessSD 0.47 -0.40 -0.78 -0.17
## Generosity 0.00 0.15 0.01 0.14 0.11
## SocialSupport -0.19 -0.39 0.24 -0.17 -0.16
## NegAffect -0.17 -0.17 -0.02 -0.09 -0.26
## PosAffect -0.38 0.12 0.28 -0.03 -0.34
## Freedom -0.19 -0.01 0.03 0.36 -0.07
## Generosity SocialSupport NegAffect PosAffect
## LogGDPPC
## HappinessCV
## AVGIsCorrupt
## HappinessSD
## Generosity
## SocialSupport 0.05
## NegAffect -0.11 0.19
## PosAffect -0.31 -0.15 0.21
## Freedom -0.12 -0.12 0.05 -0.37
## Analysis of Variance Table
##
## Response: bcAVGHappiness
## Df Sum Sq Mean Sq F value Pr(>F)
## LogGDPPC 1 2.73074 2.73074 34165.4201 < 2.2e-16 ***
## HappinessCV 1 0.66606 0.66606 8333.3689 < 2.2e-16 ***
## AVGIsCorrupt 1 0.00119 0.00119 14.8716 0.000121 ***
## HappinessSD 1 1.06220 1.06220 13289.6457 < 2.2e-16 ***
## Generosity 1 0.00696 0.00696 87.1383 < 2.2e-16 ***
## SocialSupport 1 0.00143 0.00143 17.9428 2.446e-05 ***
## NegAffect 1 0.00056 0.00056 7.0365 0.008089 **
## PosAffect 1 0.00020 0.00020 2.4446 0.118183
## Freedom 1 0.00041 0.00041 5.0690 0.024533 *
## Residuals 1233 0.09855 0.00008
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.866007 0.004563 189.786636 0.000000
## LogGDPPC 0.005403 0.000359 15.037713 0.000000
## HappinessCV -0.651043 0.005304 -122.744574 0.000000
## AVGIsCorrupt -0.016917 0.001729 -9.784107 0.000000
## HappinessSD 0.133593 0.001351 98.905659 0.000000
## Generosity 0.016228 0.001875 8.654876 0.000000
## SocialSupport 0.015560 0.003299 4.716281 0.000003
## NegAffect 0.009453 0.004121 2.293662 0.021978
## PosAffect -0.008101 0.003553 -2.280368 0.022756
## Freedom 0.005549 0.002465 2.251445 0.024533
There was no discernible change in the r-squared value, however we did very slightly improve the model by removing LifeExpectancy.
Removing Freedom, NegAffect, and PosAffect from the model, we obtain:
##
## Call:
## lm(formula = bcAVGHappiness ~ LogGDPPC + HappinessCV + AVGIsCorrupt +
## HappinessSD + Generosity + SocialSupport)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.041785 -0.004910 -0.000158 0.004655 0.044025
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8650250 0.0039098 221.247 < 2e-16 ***
## LogGDPPC 0.0056856 0.0003499 16.251 < 2e-16 ***
## HappinessCV -0.6472603 0.0050277 -128.739 < 2e-16 ***
## AVGIsCorrupt -0.0173618 0.0015891 -10.925 < 2e-16 ***
## HappinessSD 0.1335793 0.0012221 109.304 < 2e-16 ***
## Generosity 0.0156480 0.0017253 9.070 < 2e-16 ***
## SocialSupport 0.0130972 0.0031064 4.216 2.67e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.008982 on 1236 degrees of freedom
## Multiple R-squared: 0.9782, Adjusted R-squared: 0.9781
## F-statistic: 9232 on 6 and 1236 DF, p-value: < 2.2e-16
##
## Correlation of Coefficients:
## (Intercept) LogGDPPC HappinessCV AVGIsCorrupt HappinessSD
## LogGDPPC -0.65
## HappinessCV -0.64 0.49
## AVGIsCorrupt -0.38 0.27 -0.02
## HappinessSD 0.34 -0.45 -0.80 -0.14
## Generosity -0.26 0.21 0.15 0.27 -0.07
## SocialSupport -0.38 -0.34 0.38 -0.08 -0.24
## Generosity
## LogGDPPC
## HappinessCV
## AVGIsCorrupt
## HappinessSD
## Generosity
## SocialSupport -0.05
## Analysis of Variance Table
##
## Response: bcAVGHappiness
## Df Sum Sq Mean Sq F value Pr(>F)
## LogGDPPC 1 2.73074 2.73074 33849.109 < 2.2e-16 ***
## HappinessCV 1 0.66606 0.66606 8256.217 < 2.2e-16 ***
## AVGIsCorrupt 1 0.00119 0.00119 14.734 0.0001301 ***
## HappinessSD 1 1.06220 1.06220 13166.607 < 2.2e-16 ***
## Generosity 1 0.00696 0.00696 86.332 < 2.2e-16 ***
## SocialSupport 1 0.00143 0.00143 17.777 2.665e-05 ***
## Residuals 1236 0.09971 0.00008
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.865025 0.003910 221.247124 0.0e+00
## LogGDPPC 0.005686 0.000350 16.250739 0.0e+00
## HappinessCV -0.647260 0.005028 -128.739116 0.0e+00
## AVGIsCorrupt -0.017362 0.001589 -10.925292 0.0e+00
## HappinessSD 0.133579 0.001222 109.303823 0.0e+00
## Generosity 0.015648 0.001725 9.069687 0.0e+00
## SocialSupport 0.013097 0.003106 4.216243 2.7e-05
By removing these three predictors, the model changed marginally. The r-squared value went down by .0002, which is more than made up for by the fact that the model is much simpler and has less variability. None of these three predictors were significant at the .99 level.
The residuals plot for fit4 is still a bit of a mess, and there isn’t much improvement in the model from manual elimination. However, there is a way to automate that process for the optimal results.
## Start: AIC=-11718.79
## bcAVGHappiness ~ year + LogGDPPC + SocialSupport + LifeExpectancy +
## Freedom + Generosity + AVGIsCorrupt + PosAffect + NegAffect +
## HappinessSD + HappinessCV
##
## Df Sum of Sq RSS AIC
## - LifeExpectancy 1 0.00009 0.09818 -11719.6
## <none> 0.09809 -11718.8
## - Freedom 1 0.00028 0.09837 -11717.2
## - PosAffect 1 0.00041 0.09850 -11715.7
## - year 1 0.00041 0.09850 -11715.6
## - NegAffect 1 0.00041 0.09850 -11715.6
## - SocialSupport 1 0.00178 0.09987 -11698.5
## - Generosity 1 0.00610 0.10419 -11645.8
## - AVGIsCorrupt 1 0.00753 0.10562 -11628.9
## - LogGDPPC 1 0.01178 0.10987 -11579.8
## - HappinessSD 1 0.74211 0.84020 -9051.1
## - HappinessCV 1 1.11285 1.21094 -8596.8
##
## Step: AIC=-11719.64
## bcAVGHappiness ~ year + LogGDPPC + SocialSupport + Freedom +
## Generosity + AVGIsCorrupt + PosAffect + NegAffect + HappinessSD +
## HappinessCV
##
## Df Sum of Sq RSS AIC
## <none> 0.09818 -11719.6
## + LifeExpectancy 1 0.00009 0.09809 -11718.8
## - Freedom 1 0.00029 0.09847 -11718.0
## - year 1 0.00037 0.09855 -11717.0
## - NegAffect 1 0.00039 0.09857 -11716.8
## - PosAffect 1 0.00039 0.09857 -11716.8
## - SocialSupport 1 0.00180 0.09998 -11699.1
## - Generosity 1 0.00602 0.10420 -11647.7
## - AVGIsCorrupt 1 0.00746 0.10564 -11630.6
## - LogGDPPC 1 0.01754 0.11572 -11517.3
## - HappinessSD 1 0.77895 0.87713 -8999.7
## - HappinessCV 1 1.19472 1.29290 -8517.4
## Start: AIC=-11719.64
## bcAVGHappiness ~ year + LogGDPPC + SocialSupport + Freedom +
## Generosity + AVGIsCorrupt + PosAffect + NegAffect + HappinessSD +
## HappinessCV
##
## Df Sum of Sq RSS AIC
## <none> 0.09818 -11719.6
## - Freedom 1 0.00029 0.09847 -11718.0
## - year 1 0.00037 0.09855 -11717.0
## - NegAffect 1 0.00039 0.09857 -11716.8
## - PosAffect 1 0.00039 0.09857 -11716.8
## - SocialSupport 1 0.00180 0.09998 -11699.1
## - Generosity 1 0.00602 0.10420 -11647.7
## - AVGIsCorrupt 1 0.00746 0.10564 -11630.6
## - LogGDPPC 1 0.01754 0.11572 -11517.3
## - HappinessSD 1 0.77895 0.87713 -8999.7
## - HappinessCV 1 1.19472 1.29290 -8517.4
## Analysis of Variance Table
##
## Model 1: bcAVGHappiness ~ year + LogGDPPC + SocialSupport + LifeExpectancy +
## Freedom + Generosity + AVGIsCorrupt + PosAffect + NegAffect +
## HappinessSD + HappinessCV
## Model 2: bcAVGHappiness ~ year + LogGDPPC + SocialSupport + Freedom +
## Generosity + AVGIsCorrupt + PosAffect + NegAffect + HappinessSD +
## HappinessCV
## Model 3: bcAVGHappiness ~ year + LogGDPPC + SocialSupport + Freedom +
## Generosity + AVGIsCorrupt + PosAffect + NegAffect + HappinessSD +
## HappinessCV
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1231 0.098092
## 2 1232 0.098182 -1 -9.0456e-05 1.1352 0.2869
## 3 1232 0.098182 0 0.0000e+00
The AIC algorithm only removed one predictor - LifeExpectancy. This is the same predictor that was initially removed in the manual backwards elimination. This method does not yield any notable improvement.
## [1] 0.5977580 0.9654121 0.9730310 0.9763344 0.9778589 0.9781729 0.9782960
## [8] 0.9783992 0.9784448 0.9785080 0.9785278
## Subset selection object
## 11 Variables (and intercept)
## Forced in Forced out
## year FALSE FALSE
## LogGDPPC FALSE FALSE
## SocialSupport FALSE FALSE
## LifeExpectancy FALSE FALSE
## Freedom FALSE FALSE
## Generosity FALSE FALSE
## AVGIsCorrupt FALSE FALSE
## PosAffect FALSE FALSE
## NegAffect FALSE FALSE
## HappinessSD FALSE FALSE
## HappinessCV FALSE FALSE
## 1 subsets of each size up to 11
## Selection Algorithm: exhaustive
## year LogGDPPC SocialSupport LifeExpectancy Freedom Generosity
## 1 ( 1 ) " " "*" " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " "
## 5 ( 1 ) " " "*" " " " " " " "*"
## 6 ( 1 ) " " "*" "*" " " " " "*"
## 7 ( 1 ) " " "*" "*" " " " " "*"
## 8 ( 1 ) "*" "*" "*" " " " " "*"
## 9 ( 1 ) "*" "*" "*" " " " " "*"
## 10 ( 1 ) "*" "*" "*" " " "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*" "*"
## AVGIsCorrupt PosAffect NegAffect HappinessSD HappinessCV
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " "*" "*"
## 3 ( 1 ) " " " " " " "*" "*"
## 4 ( 1 ) "*" " " " " "*" "*"
## 5 ( 1 ) "*" " " " " "*" "*"
## 6 ( 1 ) "*" " " " " "*" "*"
## 7 ( 1 ) "*" " " "*" "*" "*"
## 8 ( 1 ) "*" " " "*" "*" "*"
## 9 ( 1 ) "*" "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*"
It looks like the highest \(r^2\) value is obtained by using all 11 predictors. This is a similar conclusion to what we reached before, however it takes more than the \(r^2\) value to determine the best fitting model.
##
## Attaching package: 'ggvis'
## The following object is masked from 'package:ggplot2':
##
## resolution
Plotted above is the number of variables included in a model, and the best \(r^2\) for models with that many predictors. The plot shows that little change in the \(r^2\) value for the best model with 2 predictors and the best model with all 11 predictors, and a very small difference between 3 predictors and 11.
The two criterion that I normally use for selecting a model are AIC and BIC. In this case the two methods give significantly different results. AIC gives the optimal number of predictors as 10, while BIC gives the optimal predictors at 6. In the case of this data set, it makes the most sense to go with the BIC recommended 6 predictors. The first reason for this is the general application of AIC vs BIC. Generally, AIC is preferred for predictions, while BIC is best used for explanation. The second the reason for selecting the BIC recommendation is the charts above and their underlying data. The improvement seen between 6 and 10 predictors is very small in terms of RSS, \(r^2\), and Cp, thus 6 predictors is preferable to avoid over-fitting and over-complication of the model.
This the final model’s summary and ANOVA:
##
## Call:
## lm(formula = bcAVGHappiness ~ (LogGDPPC + SocialSupport + Generosity +
## AVGIsCorrupt + HappinessSD + HappinessCV)^2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.047158 -0.001590 0.000424 0.002401 0.033954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6896418 0.0219452 31.426 < 2e-16 ***
## LogGDPPC 0.0232817 0.0023834 9.768 < 2e-16 ***
## SocialSupport 0.1534311 0.0216933 7.073 2.55e-12 ***
## Generosity 0.0029281 0.0187887 0.156 0.876180
## AVGIsCorrupt -0.0714945 0.0179016 -3.994 6.89e-05 ***
## HappinessSD 0.2329423 0.0094061 24.765 < 2e-16 ***
## HappinessCV -0.7530690 0.0336030 -22.411 < 2e-16 ***
## LogGDPPC:SocialSupport -0.0064215 0.0022552 -2.847 0.004482 **
## LogGDPPC:Generosity 0.0045935 0.0017721 2.592 0.009655 **
## LogGDPPC:AVGIsCorrupt 0.0054655 0.0013461 4.060 5.22e-05 ***
## LogGDPPC:HappinessSD -0.0090250 0.0008434 -10.701 < 2e-16 ***
## LogGDPPC:HappinessCV -0.0043521 0.0031009 -1.404 0.160715
## SocialSupport:Generosity -0.0209609 0.0159531 -1.314 0.189123
## SocialSupport:AVGIsCorrupt 0.0225054 0.0138143 1.629 0.103542
## SocialSupport:HappinessSD -0.0268466 0.0073304 -3.662 0.000261 ***
## SocialSupport:HappinessCV -0.1392408 0.0253547 -5.492 4.84e-08 ***
## Generosity:AVGIsCorrupt 0.0021679 0.0064995 0.334 0.738777
## Generosity:HappinessSD -0.0144780 0.0057592 -2.514 0.012068 *
## Generosity:HappinessCV -0.0009582 0.0227901 -0.042 0.966472
## AVGIsCorrupt:HappinessSD -0.0246732 0.0057242 -4.310 1.76e-05 ***
## AVGIsCorrupt:HappinessCV 0.1325817 0.0248166 5.342 1.09e-07 ***
## HappinessSD:HappinessCV 0.0554793 0.0037012 14.990 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.00549 on 1221 degrees of freedom
## Multiple R-squared: 0.9919, Adjusted R-squared: 0.9918
## F-statistic: 7159 on 21 and 1221 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: bcAVGHappiness
## Df Sum Sq Mean Sq F value Pr(>F)
## LogGDPPC 1 2.73074 2.73074 90593.7867 < 2.2e-16 ***
## SocialSupport 1 0.29058 0.29058 9640.2235 < 2.2e-16 ***
## Generosity 1 0.09041 0.09041 2999.3311 < 2.2e-16 ***
## AVGIsCorrupt 1 0.00993 0.00993 329.3238 < 2.2e-16 ***
## HappinessSD 1 0.00986 0.00986 327.2295 < 2.2e-16 ***
## HappinessCV 1 1.33707 1.33707 44358.0269 < 2.2e-16 ***
## LogGDPPC:SocialSupport 1 0.00141 0.00141 46.7197 1.290e-11 ***
## LogGDPPC:Generosity 1 0.00042 0.00042 13.7992 0.0002126 ***
## LogGDPPC:AVGIsCorrupt 1 0.00450 0.00450 149.3203 < 2.2e-16 ***
## LogGDPPC:HappinessSD 1 0.03819 0.03819 1267.0689 < 2.2e-16 ***
## LogGDPPC:HappinessCV 1 0.00016 0.00016 5.3481 0.0209109 *
## SocialSupport:Generosity 1 0.00003 0.00003 1.1298 0.2880290
## SocialSupport:AVGIsCorrupt 1 0.00095 0.00095 31.6683 2.266e-08 ***
## SocialSupport:HappinessSD 1 0.00678 0.00678 224.9875 < 2.2e-16 ***
## SocialSupport:HappinessCV 1 0.00047 0.00047 15.5563 8.464e-05 ***
## Generosity:AVGIsCorrupt 1 0.00014 0.00014 4.7420 0.0296255 *
## Generosity:HappinessSD 1 0.00223 0.00223 73.9867 < 2.2e-16 ***
## Generosity:HappinessCV 1 0.00001 0.00001 0.2867 0.5924088
## AVGIsCorrupt:HappinessSD 1 0.00008 0.00008 2.6926 0.1010700
## AVGIsCorrupt:HappinessCV 1 0.00075 0.00075 25.0333 6.459e-07 ***
## HappinessSD:HappinessCV 1 0.00677 0.00677 224.6920 < 2.2e-16 ***
## Residuals 1221 0.03680 0.00003
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All of the predictors are significant at the 99.9% level in the summary and ANOVA. When factored into the model, 9 predictor interactions are noted as significant at the 99% level, 10 in the anova table. Including interactions in the model increases the r-squared by almost 2%! This means that less than one percent of the error in the data cannot be explained by using our model. This is impressive precision for survery data.
The bonferroni outlier test determined the following rows are outliers: 1238, 1216, 510, 1190, 156, 268, 549, 980, 127, 190, 652, 1167. The plots for the model support this conclusion as well.
I created a new data set without these outliers, and the recreated the model using the new data, as well as an analysis of the effects of these removal.
##
## Call:
## lm(formula = dataset2$AVGHappiness ~ (dataset2$LogGDPPC + dataset2$SocialSupport +
## dataset2$Generosity + dataset2$AVGIsCorrupt + dataset2$HappinessSD +
## dataset2$HappinessCV)^2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51708 -0.08159 -0.02685 0.05881 0.64660
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 0.06703 0.54810 0.122
## dataset2$LogGDPPC 0.61984 0.05947 10.422
## dataset2$SocialSupport 5.39338 0.54368 9.920
## dataset2$Generosity 0.92828 0.47802 1.942
## dataset2$AVGIsCorrupt -4.06829 0.44918 -9.057
## dataset2$HappinessSD 0.77512 0.23745 3.264
## dataset2$HappinessCV 3.97841 0.83976 4.738
## dataset2$LogGDPPC:dataset2$SocialSupport -0.45860 0.05665 -8.095
## dataset2$LogGDPPC:dataset2$Generosity 0.06282 0.04495 1.398
## dataset2$LogGDPPC:dataset2$AVGIsCorrupt 0.32425 0.03359 9.652
## dataset2$LogGDPPC:dataset2$HappinessSD 0.12388 0.02098 5.903
## dataset2$LogGDPPC:dataset2$HappinessCV -1.80232 0.07843 -22.981
## dataset2$SocialSupport:dataset2$Generosity -0.98942 0.39808 -2.485
## dataset2$SocialSupport:dataset2$AVGIsCorrupt 0.27729 0.34263 0.809
## dataset2$SocialSupport:dataset2$HappinessSD 1.22058 0.18874 6.467
## dataset2$SocialSupport:dataset2$HappinessCV -9.49431 0.65684 -14.455
## dataset2$Generosity:dataset2$AVGIsCorrupt 0.52137 0.16194 3.220
## dataset2$Generosity:dataset2$HappinessSD 0.33993 0.14789 2.299
## dataset2$Generosity:dataset2$HappinessCV -4.57229 0.61817 -7.396
## dataset2$AVGIsCorrupt:dataset2$HappinessSD -0.97531 0.14208 -6.865
## dataset2$AVGIsCorrupt:dataset2$HappinessCV 7.04366 0.62157 11.332
## dataset2$HappinessSD:dataset2$HappinessCV 0.78771 0.09719 8.105
## Pr(>|t|)
## (Intercept) 0.90269
## dataset2$LogGDPPC < 2e-16 ***
## dataset2$SocialSupport < 2e-16 ***
## dataset2$Generosity 0.05238 .
## dataset2$AVGIsCorrupt < 2e-16 ***
## dataset2$HappinessSD 0.00113 **
## dataset2$HappinessCV 2.42e-06 ***
## dataset2$LogGDPPC:dataset2$SocialSupport 1.39e-15 ***
## dataset2$LogGDPPC:dataset2$Generosity 0.16248
## dataset2$LogGDPPC:dataset2$AVGIsCorrupt < 2e-16 ***
## dataset2$LogGDPPC:dataset2$HappinessSD 4.62e-09 ***
## dataset2$LogGDPPC:dataset2$HappinessCV < 2e-16 ***
## dataset2$SocialSupport:dataset2$Generosity 0.01307 *
## dataset2$SocialSupport:dataset2$AVGIsCorrupt 0.41850
## dataset2$SocialSupport:dataset2$HappinessSD 1.45e-10 ***
## dataset2$SocialSupport:dataset2$HappinessCV < 2e-16 ***
## dataset2$Generosity:dataset2$AVGIsCorrupt 0.00132 **
## dataset2$Generosity:dataset2$HappinessSD 0.02170 *
## dataset2$Generosity:dataset2$HappinessCV 2.61e-13 ***
## dataset2$AVGIsCorrupt:dataset2$HappinessSD 1.06e-11 ***
## dataset2$AVGIsCorrupt:dataset2$HappinessCV < 2e-16 ***
## dataset2$HappinessSD:dataset2$HappinessCV 1.28e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1356 on 1209 degrees of freedom
## Multiple R-squared: 0.9857, Adjusted R-squared: 0.9855
## F-statistic: 3977 on 21 and 1209 DF, p-value: < 2.2e-16
The removal of the outliers made the \(r^2\) of the model increase, however one of the predictors and its consituent interactions are no longer significant at the .999 level. Removing the predictor Generosity has a extremely small effect on the accuracy of the model, and for the sake of a concise model, I will continue without it.
##
## Call:
## lm(formula = dataset2$AVGHappiness ~ (dataset2$LogGDPPC + dataset2$SocialSupport +
## dataset2$AVGIsCorrupt + dataset2$HappinessSD + dataset2$HappinessCV)^2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.43414 -0.08683 -0.02178 0.05827 0.76636
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 0.51409 0.54440 0.944
## dataset2$LogGDPPC 0.60118 0.05862 10.256
## dataset2$SocialSupport 5.79157 0.55249 10.483
## dataset2$AVGIsCorrupt -4.27329 0.46272 -9.235
## dataset2$HappinessSD 0.51708 0.24348 2.124
## dataset2$HappinessCV 2.98051 0.83668 3.562
## dataset2$LogGDPPC:dataset2$SocialSupport -0.49502 0.05518 -8.970
## dataset2$LogGDPPC:dataset2$AVGIsCorrupt 0.33629 0.03468 9.697
## dataset2$LogGDPPC:dataset2$HappinessSD 0.12606 0.02164 5.826
## dataset2$LogGDPPC:dataset2$HappinessCV -1.68895 0.07933 -21.290
## dataset2$SocialSupport:dataset2$AVGIsCorrupt 0.27056 0.35860 0.754
## dataset2$SocialSupport:dataset2$HappinessSD 1.43204 0.19589 7.310
## dataset2$SocialSupport:dataset2$HappinessCV -10.67899 0.66684 -16.014
## dataset2$AVGIsCorrupt:dataset2$HappinessSD -1.05878 0.15019 -7.049
## dataset2$AVGIsCorrupt:dataset2$HappinessCV 7.63523 0.64174 11.898
## dataset2$HappinessSD:dataset2$HappinessCV 1.03578 0.09936 10.424
## Pr(>|t|)
## (Intercept) 0.345185
## dataset2$LogGDPPC < 2e-16 ***
## dataset2$SocialSupport < 2e-16 ***
## dataset2$AVGIsCorrupt < 2e-16 ***
## dataset2$HappinessSD 0.033895 *
## dataset2$HappinessCV 0.000382 ***
## dataset2$LogGDPPC:dataset2$SocialSupport < 2e-16 ***
## dataset2$LogGDPPC:dataset2$AVGIsCorrupt < 2e-16 ***
## dataset2$LogGDPPC:dataset2$HappinessSD 7.25e-09 ***
## dataset2$LogGDPPC:dataset2$HappinessCV < 2e-16 ***
## dataset2$SocialSupport:dataset2$AVGIsCorrupt 0.450708
## dataset2$SocialSupport:dataset2$HappinessSD 4.81e-13 ***
## dataset2$SocialSupport:dataset2$HappinessCV < 2e-16 ***
## dataset2$AVGIsCorrupt:dataset2$HappinessSD 3.00e-12 ***
## dataset2$AVGIsCorrupt:dataset2$HappinessCV < 2e-16 ***
## dataset2$HappinessSD:dataset2$HappinessCV < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1449 on 1215 degrees of freedom
## Multiple R-squared: 0.9836, Adjusted R-squared: 0.9834
## F-statistic: 4862 on 15 and 1215 DF, p-value: < 2.2e-16
This is now the current model summary, without Generosity.
The box cox plot suggests that a different transformation is needed than before.
##
## Call:
## lm(formula = bcAVGHappiness1 ~ (dataset2$LogGDPPC + dataset2$SocialSupport +
## dataset2$AVGIsCorrupt + dataset2$HappinessSD + dataset2$HappinessCV)^2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0248239 -0.0019158 0.0004211 0.0026994 0.0253266
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 0.7621036 0.0205659 37.057
## dataset2$LogGDPPC 0.0245117 0.0022144 11.069
## dataset2$SocialSupport 0.1575643 0.0208716 7.549
## dataset2$AVGIsCorrupt -0.1066589 0.0174806 -6.102
## dataset2$HappinessSD 0.2612937 0.0091980 28.408
## dataset2$HappinessCV -0.9019550 0.0316077 -28.536
## dataset2$LogGDPPC:dataset2$SocialSupport -0.0057653 0.0020847 -2.766
## dataset2$LogGDPPC:dataset2$AVGIsCorrupt 0.0081577 0.0013101 6.227
## dataset2$LogGDPPC:dataset2$HappinessSD -0.0094438 0.0008174 -11.554
## dataset2$LogGDPPC:dataset2$HappinessCV -0.0122067 0.0029969 -4.073
## dataset2$SocialSupport:dataset2$AVGIsCorrupt 0.0222482 0.0135470 1.642
## dataset2$SocialSupport:dataset2$HappinessSD -0.0295686 0.0074002 -3.996
## dataset2$SocialSupport:dataset2$HappinessCV -0.1569326 0.0251917 -6.230
## dataset2$AVGIsCorrupt:dataset2$HappinessSD -0.0319110 0.0056739 -5.624
## dataset2$AVGIsCorrupt:dataset2$HappinessCV 0.2014421 0.0242432 8.309
## dataset2$HappinessSD:dataset2$HappinessCV 0.0768856 0.0037537 20.483
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## dataset2$LogGDPPC < 2e-16 ***
## dataset2$SocialSupport 8.56e-14 ***
## dataset2$AVGIsCorrupt 1.41e-09 ***
## dataset2$HappinessSD < 2e-16 ***
## dataset2$HappinessCV < 2e-16 ***
## dataset2$LogGDPPC:dataset2$SocialSupport 0.00577 **
## dataset2$LogGDPPC:dataset2$AVGIsCorrupt 6.54e-10 ***
## dataset2$LogGDPPC:dataset2$HappinessSD < 2e-16 ***
## dataset2$LogGDPPC:dataset2$HappinessCV 4.94e-05 ***
## dataset2$SocialSupport:dataset2$AVGIsCorrupt 0.10079
## dataset2$SocialSupport:dataset2$HappinessSD 6.84e-05 ***
## dataset2$SocialSupport:dataset2$HappinessCV 6.44e-10 ***
## dataset2$AVGIsCorrupt:dataset2$HappinessSD 2.31e-08 ***
## dataset2$AVGIsCorrupt:dataset2$HappinessCV 2.55e-16 ***
## dataset2$HappinessSD:dataset2$HappinessCV < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.005475 on 1215 degrees of freedom
## Multiple R-squared: 0.994, Adjusted R-squared: 0.9939
## F-statistic: 1.34e+04 on 15 and 1215 DF, p-value: < 2.2e-16
The new transformation of ^-675 on the dependent variable increases the r-squared value significantly.
The residuals appear to be greatly improved from the original model. For reference, below is the original linear model with no transformations, outlier omissions, or predicor omissions.
The vast majority of the curvature was removed from the plots through the transformation and selection proccess. There is still a large divergence in the Normal Q-Q plot, but this may be explained later.
There appears to be heteroskedasticity in the model, as it looks like there is less residual on the high end of the fitted values. There is potentially some shape to the Studentized Deleted Residuals, but I cannot determine what that is from just looking at the plot.
Added-Variable Plots give us a more clear picture of how each independent variable affects the regression model. The more closely the regression line follows the apparent shape of the plot is how closely correlated the variables are, given else. Most of the plots appear to show little linear effect on the regression model, however there are some that appear quite linear in nature. Some prime examples of linear effect are HappinessCV, LogGDPPC:HappinessSD, and LodGDPPC:HappinessCV.
The box plot of the residuals appears to show that the residuals are close to normal.
The residuals of each year appear to be roughly centered around zero, however the residuals of the latest four years appear to show increased variance when compared the the previous years. This indicates potential heteroskedasticity.
##
## studentized Breusch-Pagan test
##
## data: fit02
## BP = 282.8, df = 15, p-value < 2.2e-16
The purpose of this test is to confirm the existence of heteroskedasticity in the linear regression model. The null hypothesis is homoskedasticity of the model, and the alternative hypothesis is that the model is heteroskedastic. With an alpha confidence level of .99, we can reject the null hypothesis with a p-value of < 2.2e-16.
We therefore conclude that there is heteroskedasticity in the linear regression model. The variance of the errors in our model is not independent of the predictors.
Reviewing the plot above supports the finding that there is heteroskedasticity in the model.
There is no visible curvature, and the points of the plot are all relatively close to the regression line. This means that the regression model is likely a good fit for the data.
The plot above takes three different statistics into account; hat values are on the x axis, studentized residuals on the y, and the radius of the circle for each point is the point’s leverage (or hat matrix diagonals). The larger a data point’s circle, the greater its influence on the regression model. The plot above shows us that the most influential points are >2sd or <-2sd.
The three preceding plots all tell us essentiall the same thing. The vast majority of the data points have little influence on the regression model, but there are enough points that are extremely close to the fitted value that the couple dozen values that fit the model worse have a small effect, even with the higher leverage of the ill-fitting points. Even though these plots tell us that some points are potentially lessening the model’s precision, the r-squared value is high enough that no action is needed.
The following parameters are included in the final regression model:
AVGHappiness^-.675 : This is the dependent variable transformed by a box cox value of -.675. This variable represents the average national happiness level from a survery.
LogGDPPC : This is the log transformed Gross Domestic Product per Capita for the country in the year of the survey. This predictor is the most significant single predictor of the independent variable. There is a positive linear relationship between GDPPC and a nation’s (transformed) happiness with the model estimating the relationship to be 0.0245117.
SocialSupport : There is a positive relationship between the amount of social support a country, on average, feels they receive from their peers and the average happiness level in that country. The model estimates the relationship with the transformed response to be 0.1575643.
AVGIsCorrupt : The more corrupt a nation’s populace perceives their government to be, the less happy they are. The model estimates the relationship with the transformed response to be -0.1066589.
HappinessSD : This is the standard deviation of the AVGHappiness for the year in question. The model estimates the relationship with the transformed response to be 0.2612937.
HappinessCV : This is the coefficient of variation of the AVGHappiness for the year in question. The model estimates the relationship with the transformed response to be -0.9019550. This is almost a prefect negative linear relationship, and by far the most linear relationship in the model. This indicates that the more variance there is in the happiness of a country’s population, the lower the average happiness will be.
There are 10 interactions between the predictors that made it into the final model. Of these, only one is not significant at the .99 level. This means that there is a significant statistical relationship between almost all of the predictors.
## Warning in predict.lm(fit02, interval = "prediction", level = 0.95): predictions on current data refer to _future_ responses
I used the sample function to randomly select 3 rows from the model’s predictions and confidence intervals and will analyze them below.
## fit lwr upr
## 985 1.0566736 1.0458618 1.0674853
## 477 0.9238774 0.9131079 0.9346469
## 208 1.0412161 1.0304497 1.0519826
## fit lwr upr
## 985 1.0566736 1.0554488 1.0578984
## 477 0.9238774 0.9231097 0.9246451
## 208 1.0412161 1.0404930 1.0419392
## # A tibble: 3 x 8
## country year TransformedAVGHappiness LogGDPPC SocialSupport
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Colombia 2016 1.050715 9.486231 0.8819004
## 2 Armenia 2013 0.926007 8.919060 0.7232600
## 3 Russia 2014 1.041289 10.121573 0.9317554
## # ... with 3 more variables: AVGIsCorrupt <dbl>, HappinessSD <dbl>,
## # HappinessCV <dbl>
The response is within the .95 confidence limit for the prediction, and the range of the confidence is very small, which shows us that the model is accurate with respect to this point. The SE for this point is 3.550623710^{-5}.
The response is not within the .95 confidence limit for the prediction, which shows us that the model is not accurate with respect to this point. The SE for this point is 4.535101210^{-6}.
The response is within the .95 confidence limit for the prediction, and the range of the confidence is very small, which shows us that the model is accurate with respect to this point. The SE for this point is 5.247807710^{-9}.